Working with DataFrames

Both SimpleSDMLayers.jl and GBIF.jl offer an optional integration with the DataFrames.jl package. Therefore, our previous example with the kingfisher Megaceryle alcyon could also be approached with a DataFrame-centered workflow.

We will illustrate this using the same data and producing the same figures as in the previous example. To do so, we will use GBIF.jl to produce the occurrence DataFrame we will use throughout this example. However, it is also possible to use a DataFrame of your choosing instead of one generated by GBIF.jl, as long as it holds one occurrence per row, a column with the latitude coordinates, and a column with longitude coordinates. For the rest, it can hold whatever information you like. Most of our functions assume by default that the coordinates are stored in columns named :latitude and :longitude (the order doesn't matter), but you can generally specify other names with latitude = :lat in case you don't want to rename them (we will show you how below).

So let's start by getting our data:

# Load packages
using SimpleSDMLayers
using GBIF
using Plots
using Statistics
# Load DataFrames too
using DataFrames

# Load environmental data
temperature, precipitation = worldclim([1,12])

# Get GBIF occurrences
kingfisher = GBIF.taxon("Megaceryle alcyon", strict=true)
kf_occurrences = occurrences(kingfisher,
                             "hasCoordinate" => "true",
                             "decimalLatitude" => (0.0, 65.0),
                             "decimalLongitude" => (-180.0, -50.0),
                             "limit" => 200)
for i in 1:4
  occurrences!(kf_occurrences)
end
@info kf_occurrences
[ Info: DataFrames integration loaded
[ Info: Loading DataFrames support for GBIF.jl
[ Info: GBIF records: downloaded 1000 out of 100000

Once the data is loaded, we can easily convert the environmental layers to a DataFrame with the corresponding coordinates. We can do this for a single layer or for multiple layers at the same time:

# Single layer
temperature_df = DataFrame(temperature)
# Multiple layers
env_layers = [temperature, precipitation]
env_df = DataFrame(env_layers)
rename!(env_df, :x1 => :temperature, :x2 => :precipitation)
first(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Union…Union…
1-179.917-89.9167-31.017143.0
2-179.917-89.75-30.391940.0
3-179.917-89.5833-33.482240.0
4-179.917-89.4167-33.610437.0
5-179.917-89.25-33.719940.0

Note that the resulting DataFrame will include the values set to nothing in the layers. We might want to remove those rows using filter!:

filter!(x -> !isnothing(x.temperature) && !isnothing(x.precipitation), env_df);
last(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Union…Union…
1179.91770.9167-11.1038153.0
2179.91771.0833-12.7957161.0
3179.91771.25-12.8151148.0
4179.91771.4167-12.3703136.0
5179.91771.5833-12.3328122.0

GBIF.jl allows us to convert a set of occurrences to a DataFrame just as easily:

kf_df = DataFrame(kf_occurrences)
last(kf_df, 5)

5 rows × 18 columns

keynamedatasetpublished_inobserved_inlatitudelongitudedaterankobserverlicensekingdomphylumclassorderfamilygenusspecies
Int64Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?DateTim…?Abstrac…?Abstrac…?Abstrac…?String?String?String?String?String?String?String?
12573871865Megaceryle alcyoniNaturalist research-grade observationsUSMX23.113-106.282020-02-11T14:05:00SPECIESFRANCISCO MIGUEL FARRIOLS ESTRADAhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
22573872043Megaceryle alcyoniNaturalist research-grade observationsUSUS29.3455-98.45042020-02-14T09:58:00SPECIESPeter Josephhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
32573873683Megaceryle alcyoniNaturalist research-grade observationsUSUS31.3791-94.72672020-02-10T00:00:00SPECIEStnewmanhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
42573873789Megaceryle alcyoniNaturalist research-grade observationsUSUS26.229-97.34732020-02-11T12:42:57SPECIESola10birderhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
52573874984Megaceryle alcyoniNaturalist research-grade observationsUSUS31.0323-97.42132020-02-23T16:20:00SPECIESA. Flinn (and M. Finn)http://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon

We can then extract the temperature values for all the occurrences.

temperature[kf_df]
1000-element Array{Float32,1}:
 18.931166
 18.931166
 17.857887
 17.08904
  9.419209
 12.142196
 10.534634
 12.142196
 24.394215
 14.874448
  ⋮
 18.901596
 10.245042
 23.56406
  9.809229
 25.076925
 20.466646
 18.83451
 22.63053
 19.26551

Or we can clip the layers according to the occurrences:

temperature_clip = clip(temperature, kf_df)
precipitation_clip = clip(precipitation, kf_df)
SDM predictor → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

In case your DataFrame has different column names for the coordinates, for example :lat and :lon, you can clip it like this:

kf_df_shortnames = rename(kf_df, :latitude => :lat, :longitude => :lon)
clip(temperature, kf_df_shortnames; latitude = :lat, longitude = :lon)
SDM predictor → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

We can finally plot the layer and occurrence values in a similar way to any DataFrame or Array.

histogram2d(temperature_clip, precipitation_clip, c = :viridis)
scatter!(temperature_clip[kf_df], precipitation_clip[kf_df],
         lab= "", c = :white, msc = :orange)

To plot the occurrence values over space, you can use:

contour(temperature_clip, c = :alpine, title = "Temperature",
        frame = :box, fill = true)
scatter!(kf_df.longitude, kf_df.latitude,
         lab = "", c = :white, msc = :orange, ms = 2)

We can finally make a layer with the number of observations per cells:

abundance = mask(precipitation_clip, kf_occurrences, Float32)
SDM response → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

A useful trick to visualize sites with occurrences, in contrast with sites without any occurrence, is to use replace or replace! to set the values returned as 0 or true by the function mask() to nothing. This allows us to first plot a background layer with a uniform colour, covering the whole area to visualize, then plot the occurrence layer on top using a different colour scale.

abundance_nozeros = replace(abundance, 0 => nothing)
plot(precipitation_clip, c = :lightgrey)
plot!(abundance_nozeros, c = :viridis, clim = extrema(abundance_nozeros))

Once again, the cells are rather small, and there are few observations, so this is not necessarily going to be very informative. As in our other example, to get a better sense of the distribution of observations, we can get the average number of observations in a radius of 100km around each cell (we will do so for a zoomed-in part of the map to save time):

zoom = abundance[left = -100.0, right = -75.0, top = 43.0, bottom = 20.0]
buffered = slidingwindow(zoom, Statistics.mean, 100.0)
plot(buffered, c = :lapaz, legend = false, frame = :box)
scatter!(kf_df.longitude, kf_df.latitude,
         lab = "", c = :white, msc = :orange, ms = 2, alpha = 0.5)